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Abstract 

In this paper we study convex stochastic search problems where a noisy objective function 
value is observed after a decision is made. There are many stochastic search problems whose 
behavior depends on an exogenous state variable which affects the shape of the objective func- 
tion. Currently, there is no general purpose algorithm to solve this class of problems. We use 
nonparametric density estimation to take observations from the joint state-outcome distribution 
and use them to infer the optimal decision for a given query state. We propose two solution 
methods that depend on the problem characteristics: function-based and gradient-based opti- 
mization. We examine two weighting schemes, kernel-based weights and Dirichlet process-based 
weights, for use with the solution methods. The weights and solution methods are tested on a 
synthetic multi-product newsvendor problem and the hour-ahead wind commitment problem. 
Our results show that in some cases Dirichlet process weights offer substantial benefits over 
kernel based weights and more generally that nonparametric estimation methods provide good 
solutions to otherwise intractable problems. 



1 Introduction 

Stochastic search is a class of stochastic optimization problems where we have to find a deterministic 
parameter to minimize the expectation of a function of uncertain quantities. The expectation is 
usually hard to compute, requiring instead the use of Monte Carlo samples. The problem is typically 
written 

miriE[F(a;,Z)], (1) 

where x € M"^ is the decision, X is a decision set, Z : 17 — t- \& is a random outcome and F x"^ ^ 
M is an objective function. A classic example is the newsvendor problem where we have to stock a 
quantity of product to serve an uncertain demand with an unknown distribution. Each iteration, 
we can only observe how much we sold, after which we make adjustments. A rich theory has evolved 
to address problems of this type (see fST]). 

In this paper, we introduce an important variation of the stochastic search problem. Assume 
that we are first allowed to observe a state variable S (such as whether it is raining or sunny) which 
changes our belief about the distribution of the random vector Z. After observing S, we then 
choose X, and only then do we observe Z, or we may only observe F(x,S,Z) (or its derivative). 
Each iteration starts with a new state, after which we choose an action and then observe the results. 
Since information from the current state, decision and observation is used to update beliefs for future 
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decisions, we have two challenges: 1) assembling information from previous state-decision-outcome 
pairs into something that can be used to make a decision for the current state and 2) finding the 
optimal decision given a state. 

If the state space is small (say, rainy or sunny) we can use classical methods from stochastic 
search by simply conditioning on the state when we do our updates. But it is often the case that 
5 is a vector, frequently with continuous elements. The hour ahead wind commitment problem 
is an example: a wind farm manager must pledge how much energy she will provide to a utility 
company an hour in the future. If too much energy is pledged, the difference must be bought; if 
too little is pledged, the difference is lost. The objective function depends on the future wind and 
market price, both unknown. The last 24 hours of wind and market prices, time of day and time 
of year all contain information about the objective function. This problem cannot be solved using 
standard techniques from stochastic search and stochastic optimization. 

To combat the first problem, sharing information across observations, we propose using non- 
parametric density estimation for the joint state and outcome distribution to group observations 
from "similar" states with weights. To combat the second problem, making a decision given an 
observed state, we use the weighted observations to construct convex, deterministic approximations 
of the conditional expectation of the objective function. Care is taken to ensure that the resulting 
optimization problems are computationally feasible. 

With this high level summary in mind, we turn to a more formal description of the problem 
setting. When we include the state variable, the stochastic search problem of Equation ([l]) becomes 



Note that the function F itself may change with the state. Conventional stochastic search techniques 
require us to sample from the conditional distribution p{Z\S = s), treating each state observation 
independently [57] . We use nonparametric density estimation for the joint distribution of {S, Z) to 
weight the states because similar values of S usually affect Z and F in a similar way. 

We propose a new model-free method to solve the stochastic optimization problem with an 
observable state variable. Our problem is motivated by online applications where we are given a 
state, and then after making a decision, we are given the realization of Z which depends on the state. 
For this reason, we index estimates and random variables, such as Sn, with a subscript that indicates 
at which iteration the value can be used. We use a nonparametric density estimate of (5, Z) to 
weight previous observations {Si, Zi)^^^ . Given an observed state, we generate an estimate of 
K[F(x, s, Z)], called the approximate function Fn{x\s), based on the weighted previous observations. 
For this paper, we are concerned with two classes of stochastic search solution methods for convex 
problems: 

• Function-Based Optimization: given a state Sn = s and an outcome Z{ujn+i) with ujn+i £ 
17, the entire response function F[x, s, Z{u}n+i)) is known. 

• Gradient-Based Optimization: given a state Sn = s, a decision xn and an outcome cj^^+i? 
we only observe the stochastic gradient /3(x„, s, Z{uJn+i)) = VxF{xn, s, Z{ujn+i))- 

The wind commitment problem can be solved by function-based optimization; once the wind speed 
at time n -|- 1 is known, the value for all possible commitment levels at time n is also known. In 
many problems the entire objective function is too expensive too compute or cannot be explicitly 
computed; however, it is often possible to observe or estimate derivatives around a decision value. 
For example, many resource allocation problems involve solving a linear program; the dual variables 
provide the gradients. For more complicated problems, function-based optimization may produce 



minE [F{x, s, 



Z)\S = s]. 
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functions too complicated for use in a solver. Gradient-based optimization eliminates this problem 
by restricting the form of the approximate function Fn{x\s) to piecewise linear, separable and 
convex. In both methods, however, we use weights derived from a joint distribution of (S, Z) with 
previous observations to form an approximate function, Fn{x\s). 

A large class of function-based methods currently exist for problems without a state variable |46| 
[M] . We craft a function-based optimization algorithm by extending the existing methods to 
weighted observations. We give conditions for almost sure convergence of this algorithm to a global 
optimum. 

Our approach to gradient-based optimization is less straightforward. Stochastic approxima- 
tion |45t |30] is the most popular gradient search method for problems without a state variable, 
but for reasons explained in Section [4j it cannot be extended to problems with a state variable. 
Instead, we try to construct a function that has the same behavior as the original objective function 
around the optimum. Like |3T], we use a separable, convex, piecewise linear approximation to do 
this, except that our previous observations are weighted according to the current state so that they 
produce appropriate slopes for the piecewise linear approximation. We show that the resulting 
algorithm converges to an arbitrarily small neighborhood around the optimum with probability 
one when the true objective function is itself separable. 

Both methods rely heavily on weighting functions. We give two methods to generate weights: 
kernels and Dirichlet process mixture models. Kernels are easy to implement and often give good 
results. They can develop problems, however, when the state variable is moderate to high di- 
mensional by giving all but a few observations weights that are effectively zero. This can lead 
to unstable results. As an alternative in these situations, we propose using weights generated by 
Dirichlet process mixture models. Dirichlet process mixture models are Bayesian nonparametric 
models that produce a distribution over data partitions. In effect, they cluster data in a Bayesian 
manner. We derive weights from this model by placing equal weight on all previous observations 
that are in the same cluster as the current observation. Then we approximate the average of these 
weights by taking a Monte Carlo sampling of clusterings. This method requires more work, but it 
is far more stable than kernel methods. We give conditions for when kernel and Dirichlet process 
weights satisfy the convergence criteria for both optimization algorithms. 

We test our methods on two problems, a two-product newsvendor problem and the hour ahead 
wind commitment problem. In the two-product newsvendor problem, we use synthetic data and 
compare both optimization methods under each weighting function. In the hour ahead wind com- 
mitment problem, we use synthetic price data and wind data from the North American Land Data 
Assimilation System. Due to the computational difficulties of computing weights and testing solu- 
tions every iteration, we only compare function-based optimization under the different weighting 
schemes. Dirichlet process weights produce better results for this problem. 

We contribute novel algorithms to include state variables in function-based optimization and 
gradient-based optimization problems. We study two methods to do this: kernel weights and 
Dirichlet process weights. This is a new use of Dirichlet process mixture models. We give empirical 
analysis for these methods where we show promising results on test problems. 

The paper is organized as follows. In Section [2| we review the treatment of search and optimiza- 
tion problems in the presence of a state variable in different communities. In Section [3} we review 
established function-based optimization methods, propose an algorithm that incorporates a state 
variable and prove convergence of that algorithm. In Section [4j we review current gradient-based 
optimization methods, propose an algorithm that incorporates a state variable and prove conver- 
gence of the algorithm under certain conditions. In Section [Sj we present two weighting schemes, 
kernel and Dirichlet based weights. We present an empirical analysis of our methods for synthetic 
newsvendor data and the hour ahead wind commitment problem in Section [6] and a discussion in 
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Section [7l 



2 Literature review 

Several communities, including operations research, optimization and machine learning, have stud- 
ied problems with forms similar to the stochastic search problem with a state variable of Equation 
([2]). The problems and solution methods are diverse; even within communities, optimization prob- 
lems with a state variable are never treated as an entire problem class. We briefly outline the 
resulting hodgepodge of problems and methods. 

Most commonly, stochastic search problems with a state variable are considered individually 
rather than as an entire problem class. The prevailing approach is to construct a model for Z and 
use the information in S to supply the parameters. In the case of wind farms, ^56j constructs a 
model for the wind and use a state of the world to determine parameters. They also select which 
state variables are needed to parameterize the wind distribution. Collectively, creating a model 
and selecting state variables to generate parameters for that model can require substantial time 
and domain knowledge. 

In some areas, problems with state variables have become their own classes. In the statistics 
and operations research community, there has been some study of portfolio and bandit problems in 
the presence of a state variable (called a "covariate" in the bandit literature and "side information" 
in the learning theory community). [9j and |25j studied portfolio optimization with a finite state 
variable, but handled it in a manner that amounts to treating each value as a separate problem. 

Bandit problems with a state variable are another established area of study. A bandit problem 
is a sequential decision problem with small, finite set of statistical populations (arms). At each 
iteration, only one arm can be sampled; a random reward Ri is obtained with probability pj, where 
i denotes the arm number. When a state variable 5 = s is included, the probability of a reward is 
Pi{s). The goal is to maximize the average reward. This bandit variant was first introduced by |64j 
and has been studied when the distributions are assumed to have a parametric form [50\ [62l [T9] . 
They have also been studied where the mean function has been estimated in a nonparametric 
fashion |65l I44j : however, decision-making mechanisms vary widely due to the non-convex decision 
set. 

In the optimization community, parametric nonlinear programming includes what can be viewed 
as a state variable in a math programming setup. The basic problem has the form 

mini^'fa;, s), 

where s "parameterizes" the program. Such problems have been used for sensitivity analysis |29| 
133] and have been the focus of renewed interest in the model predictive control community [3]. 
Parametric nonlinear programming, however, is deterministic and it assumes that F is known for 
a given s. 

The machine learning community solves problems with state variables more than any other 
community. Machine learning is a catch-all term for a large set of subfields, including learning 
theory, reinforcement learning, classification, Bayesian nonparametrics and many others. Problems 
with a state variable arise in some of these subfields, such as reinforcement learning and learning 
theory. 

State variables arise in a general way in dynamic programming and reinforcement learning. In 
these problems, we might be in state S and take an action o, which determines, or influences, the 
next state S' that we visit. The choice of the action a, then, needs to consider the expected value 
of being in state S' . A host of algorithms have been proposed to solve this problem class (see, for 
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example, [121 [58l |6U1 HOI S]), where the major complication is that we do not know, and have to 
approximate, the value of being in state S' . Our problem class is closest to the field of stochastic 
search [57j, where we often have to face the challenge of vector- valued (and possibly continuous) 
decision vectors x. Of all the algorithms in reinforcement learning, our problem is most similar to 
Q-learning which requires estimating the value of Q{S,a) (in our notation, Q{S,x)). Although we 
do not have to deal with the value of the downstream state, we do have to address the challenge 
of complex state variables and vector-valued decisions, which the Q-learning literature has not 
addressed. 

State variables are also common in learning theory, where they are called "side information." 
The portfolio references and many of the bandit references are written from a learning theory 
perspective. The work closest to ours is [23]; they incorporated a state variable into an online 
convex optimization problem. In this setting, each iteration a player selects a decision from a 
convex set and an adversary selects a loss function from a finite set of options. The state variable 
contains some information about the set of loss functions. They construct an algorithm that 
minimizes regret, a notion of loss under a worst-case scenario, rather than expected loss. They 
propose using a combination of a nearest-neighbor and e-net mapping from the state space to the 
decision space under smoothness constraints. Values are updated by gradient observations. Their 
algorithm does not converge to a fixed decision for a given state with more than one possible loss 
function. 

We now turn to the first of our methods, function-based optimization. 

3 Function- based optimization with an observable state variable 

We use function-based optimization when a single outcome co can tell us the value of all decisions 
given that outcome 146 1 ^55 1 154|. For example, in the hour ahead wind commitment problem, if 
the wind is known, then the value of all commitment levels is known. Function-based optimization 
relies on sampling a set of scenarios, coi, . . . ,ujn from J7, to approximate the expectation: 

^ 1=1 

Equation Q is deterministic and deterministic methods can be used. They can accommodate 
complex constraint sets but require that the entire function is known for an outcome uj. The 
following asymptotic results hold for function-based optimization. Let x* be the true solution 
and X* be the solution to Equation Q. Under sufficient conditions, X ^ Xi * almost surely as 
n — )> DO [46 1 139 t l20j. Moreover, [53] and [1^ show asymptotic normality under stricter conditions 
than those required for convergence. 

Function-based optimization has been well studied, but under a variety of names. [l6], [55] , 
and [20] call it sample path optimization; |49] use likelihood ratios to approximate the optimization 
problem, calling it the stochastic counterpart method; [24J studied this method and call it retro- 
spective optimization; within the stochastic programming literature, it is often known as sample 
average approximation [541 [3T] and scenario optimization [6J . A similar method has been used in 
discrete event systems and gradient estimation, called perturbation analysis 117]. 

3.1 Algorithm for function-based optimization 

Extensions of existing algorithms to include a state variable are fairly straightforward. New obser- 
vations of F have the form F(x, 5^, Z(a;j+i)) where Si is a random state variable. It is difficult to 
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place a prior over the space of all convex functions or create a joint distribution over states and the 
space of convex functions. Therefore, we develop a locally weighted average of the observations to 
approximate the true mean function at a given state. 

Let s G 5 be a fixed query state. We would like to minimize K[F{x, s, Z)] with respect to x. 
Let {Si, Z{uji^i))^~Q be a set of n observations and {wn{s, Si))^~Q be a set of weights based on the 
query state and the observed states where 

n-1 

'^Wnis,Si) = 1. 
i=0 

The weight functions may change with the number of observations n. Set 

n-1 

Fn{x\s) = ^ Si)F{x, Si, Z{uji+i)). (4) 

i=0 

The optimization problem becomes 

min Fn{x\s). (5) 

This produces an average of observations weighted by how close the observed states are to the 
current state. F{x, Si, Z{uji+i)) is convex in x for every Si and tOi+i, so Fn{x\s) is convex. This 
is particularly helpful because Equation ([s]) can then be solved with any number of deterministic 
solvers. We give the general procedure in Algorithm [TJ 

Algorithm 1: Function-based optimization with an observable state variable 
Require: Query state s. 
1: for i = to n — 1 do 
2: Observe random state Si. 
3: Observe random function F{x, Si, Z{uji^i)). 
4: end for 

5: Generate weights {wn{s, Si))^~Q . 
6: Set 

n-1 

Fnix\s) = ^ Wnis, Si)F{x, Si, Z{uJi+i)). 
i=0 

7: Solve 

X* (s) = min Fn{x\s). 



Define x* (s) as 

a;*(s) = argmin Fn{x\s), 

x&X 

and the true optimal decision x*[s) as 

X* (s) = arg min "KlF {x , s , Z) \ S = s\ . 

x^X 

Let E [F{x, s, Z)] = F{x\s). We would like the approximation to have the following property, 
lim X* (s) = lim argminF„(x|s) = argminF(x|s) = x*{s), almost surely. 

n—>-co n— 5>oo x£X x^X 

We discuss convergence properties in the following subsection. 
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3.2 Convergence analysis 

Let x*^{s) be the solution to Equation ([s]) and let x*{s) be the true solution. We would like 
2;*(s) — )■ x*{s) almost surely, pointwise in s, for s in a compact subset of the state space, denoted 
Sd- In Theorem |3.1[ we give conditions under which almost sure convergence occurs. 
Before we state the main theorem, we give a set of assumptions. 

1) A' is a convex subset of M'^ and Sd is a compact subset of S. 

2) The function F(x,s,Z{uj)) is almost surely convex and continuous in x € X for every 



(a: 

S £ Sd- 

(A[3|3) For every fixed s £ Sd and x £ X, let (u'„(s, 5j)o:n)^o distribution of {Z,S) 

be such that 

n-l 

lim Wn{s, Si)F{x, Si, Z{uji+i)) = F{x\s). 

1=0 

(A[3|4) For every fixed s G Sd, the function K[F{x, s, Z{u!)] has a unique minimizer. 

Assumptions (A|3jl)-(A[3j2) places bounds on the decision set and state set. Assumption (AjsjS) 
places a pointwise convergence condition on the functional estimator; this places restrictions on both 
the weighting functions Wn{s,Si) and the distribution of Z\S = s. See Section [5] for a discussion 
on this assumption. Assumption (A[3j4) assures that there is only one optimal decision per state. 
The main convergence theorem for function-based optimization is as follows. 

Theorem 3.1. Let Fn{x\s) = ^"^^^ tt;„(s, S'j, Suppose that assumptions (J^l)- 

(J^4) hold. Then, < (s) —7- x*{s) almost surely, pointwise for every s £ Sd- 



The proof of Theorem |3 . 1 1 relies heavily on Proposition 2.4 and Theorem 3.7 of [l6]. We state 
them now, modified for our setting. 

Proposition 3.2 (Proposition 2.4 of [^). If, for every s £ Sd, Fn{x\s) converges uniformly to 
F{x\s) on all compact, non-empty subsets of X , then Fn[x\s) epiconverges to F{x\s). 

Theorem 3.3 (Corollary 3.11 of [46j). Suppose that Fn{x\s) epiconverges to F{x\s) and that F(x\s) 
has a unique minimizer for a fixed s £ Sd- Then x^{s) converges almost surely to x*{s). 



Proof of Theorem 3.1. Fix s in Sd- We show almost sure convergence by satisfying the condi- 
tions of Corollary 3.11 in |46j . First, we show that Fn{x\s) — )• F{x\s) uniformly for x £ Xd, where 
Xd is a compact subset of X. Because Fn{x\s) is bounded and continuous, it is equicontinuous; 
because it is equicontinuous and converges pointwise in x to F{x\s), Fn{x\s) converges uniformly 
to F{x\s). Continuity of Fn{x\s) and uniform convergence to F{x\s) satisfy the conditions of 
Proposition 2.4 of [l6], which in turn satisfies the conditions of Corollary 3.11. 

We discuss the choice of weight functions in Section [5| Before that, however, we present an 
algorithm and theoretical results for gradient-based optimization with a state variable. 

4 Gradient-based optimization with an observable state variable 

In gradient-based optimization, we no longer observe an entire function F{x, Sn, Z{uJn+i)), but only 
a derivative taken at Xi, 

P[Xyi, Sn, 
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Stochastic approximation is the most popular way to solve stochastic search problems using a 
gradient; it modifies gradient descent algorithms to account for random gradients |45| I30j . The 
general idea is to optimize x by iterating, 

Xn+l = Tx {Xn - anVx F{Xn, Z{uJn+l))) , (6) 

where is a projection back into the constraint set X, Vx F{xn, Z{u}n+i)) is the stochastic gradient 
at Xn and an is a stepsize. Another approach to gradient-based optimization uses construction of 
piecewise linear, convex functions to approximate F{x) [18]l41j: we will follow the second approach. 

Including a state variable into gradient-based optimization is less straightforward than it is for 
function-based optimization. We encounter difficulties because we choose x„ given Sn, Equation 
(|6]) works because only Xn changes. When we include an observed state Sn, the decision Xn is 
based on the state S„. Therefore, it cannot be chosen in an iterative manner directly from Xn—i, 
which is based on the state Sn-i- Additionally, constructing the approximate function Fn{x\s) in 
a convex manner is not trivial because the gradient observations are based on both x„ and Sn- In 
this section, we give an algorithm for gradient-based optimization with a state variable, along with 
convergence analysis for that algorithm. 



4.1 Algorithm for gradient-based optimization 

We propose modeling F{x\s) with a piecewise linear, convex, separable approximation. Even if 
F{x\s) is not itself separable, we aim to approximate it with a simpler (separable) function that has 
the same minimum for every fixed s. Approximating the minimum well is easier than approximating 
the entire function [Bjdl]- Moreover, convex interpolation is easier in one dimension than multiple 
dimensions. We approximate E[F(x, s, Z)] by a series of separable functions, 

d 

Fn{x\s) = Y,fn{As), 
k=l 

where x'^ is the k^^ component of x and f^{x\s) is a univariate, piecewise linear function in x. 
We enforce convexity restrictions on marginal functions f,ll{x\s) for every s £ S. We assume the 
existence of stochastic gradients, f3{x,s,uj) = VxF{x, s, Z{u))), which are obtained as a response 
instead of F{x, s, Z{uj)). 

The observations {xi, Si, (3{xi, Si,uJi+i))'^~Q are used to update Fn{x\s) sequentially. We want 
to assemble a set of d piecewise linear marginal functions /^(x|s) by constructing a series of slopes, 
VQ.n_i{Sn), based on (3i-n- We use weights to group the gradients from states "similar" to 5„. We 
outline the algorithm as follows. 

Step 1: Observe 5„ and generate {wn{Sn, Si))^~Q This is discussed in Section [s] 

Step 2: Construct slopes for /^(xjS'n) given xo-.n-i and {wn{Sn, Si))'1^Q Fix k. We 

begin by placing the observed decisions in ascending order: 

^[0] ^ ^fl] ^ ■ ■ ■ ^ ^fn-l]' 

where [0], . . . ,[n — 1] is the ordered numbering. A necessary and sufficient condition for f^{x\Sn) 
to be convex is for the slopes to be nondecreasing; that is. 
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for every x < y. We find a set of slopes [o]{Sn) < • • • < [n-ij^*^") corresponding to the ordered 
decisions x^, . . . ,x^^_^ using weighted least squares minimization, which is a quadratic program, 



n-l 



v'^{Sn) = argmm ^?i;„ {Sn, S^) (^^{x'^^, 8^,(^1^^+^) - v^i^^ , (7) 
subject to : f [i-i] < , z = l,...,n — 1. 

Step 3: Reconstruct /^(x|S'„), Fn{x\s) given v^{Sn) Suppose that X is compact; there exists 
a minimum value x^^^ and a maximum 

^fnl ~ ^max- Define as follows. 



a minimum value and a maximum value x'^^^ for each dimension k. Set x^_^ = x^^^ and 



= Y,vt^]{Sn) (4 - Xf,„i]) + vt[,j{Sn) {x - 4,), (8) 



i=0 

where £ is the smallest index such that xj^^j < x < x^^^j^j. Set 



fc=l 

Note that the reconstruction is the same as the original up to a constant, which does not affect the 
optimal decision. 

Step 4: Choose x„ given Fn{x\Sn) We want to choose an Xn so that we learn as much as 
possible for an arbitrary s. This is done by picking follows, 

Xn = argmin F„(x|5n). (9) 

Note that Fn is a piecewise linear function; if A' is a linear constraint set, the minimum can be 
found with a linear program. 

A full overview of this procedure is given in Algorithm [2] Notice that while the function-based 
optimization of Algorithm [T] can essentially be performed in a post hoc batch setting. Algorithm [2] 
can only be performed in an online setting. We discuss a grid-based extension of Algorithm [2] in 
the following subsection. 

4.2 Grid-based decisions 

One of the more computationally heavy parts of Algorithm [2] is Step 6, the projection of the slopes 
to an ordered space via a quadratic program. The number of parameters and constraints grows 
linearly with the number of observations. In some numerical work, we have found it easier to make 
decisions on a grid format. 

Fix k. If is compact, we can create an arbitrarily fine grid on the fc*'^ dimension with a finite 
number of points, 

fli < • • • < ^^{fc)- 

Suppose that they are evenly spaced with distance a and let the intersection of this set of points 
with X be denoted . If all decisions are selected from X'^, the parameter and constraint set for 
Equation ([7]) never grows. 
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Algorithm 2: Gradient-based optimization with an observable state variable 
Require: Query state s, initial slopes vq. 



for z = to n — 1 do 

Observe random state Si. 

Generate weights {wi{Si, Sj)yj^Q. (See Section 5 ) 
for /c = 1 to d do 

Place decision observations in ascending order: < • • • < 

Compute slopes v'^{Si) by 



i-l 

Vi{Si) = argmm^Wi {Si, Sy]) (^^{xfj^ S[j],ujy^i]) - vy] 

3=0 

subject to : ^[j-i] < ^[j], j = l,...,i — 1. 



7: Reconstruct marginal function ff{x^\Si) using slopes vf{Si) as per Equation ([s] 
8: end for 
9: Set 

d 

Xi = argmin fl;{x''\Si). 
k=l 

10: Observe random gradient /3{xi, 5j, Wj+i) = VxF{xi, Si, Z{uji^i)). 
11: end for 

12: Compute v^{s), k = 1, . . . ,d as in Step 6. 

13: Compute fnix'^ls), k = 1, . . . ,d using v^{s) as in Step 7. 

14: Set 

d 

= argmin V/^(x^ I s). 

xGPc '■ — ' 
fe=l 



The inclusion of a grid changes the way that we select Xn- Let Xn be the solution to the original 
approximated problem, 

Xn = minF„,(x|S'„). 

We can generate Xn from x„ in one of two ways: 1) projection to the nearest feasible point in X'^ , 
or 2) random selection of a neighboring point in X'^ . The second option is computationally simple 
and can be guaranteed to break the constraints by at most an arbitrarily small amount through 
grid construction. We use the second method in our numerical examples. 

4.3 Convergence analysis 

We now give conditions under which Algorithm [2] converges in probability to the global optimum 
pointwise for every state s is a set of query states 5d- The observed state Sn is likely not the same 
as our query state s, but we often care about what the approximation says is the best decision for 
s after n observations, defined as x* (s). Set 

a;*(s) = argmin 
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We would like a;*(s) to approach the true optimal decision, as n — )• oo, where 

x*{s) = argmin E [F{x, s,Z) \ S = s] = argmini^(x|s). 

The outline of the proof is to show that the decisions sampled for a sufficiently small neighborhood 
of states around s accumulates in an arbitrarily small neighborhood around x*{s). This is done by 
verifying optimality conditions in the accumulation regions. First, however, we need to define a set 
of assumptions. 

Suppose that there are a finite set of functions gi{x), i = 1, . . . ,p and hj{x), j = 1, . . . ,q such 
that the constraint set X can be written as 

X = {x : gi{x) < 0, hj{x) =0, i = 1, . . . ,d, j = l,...,q} . 

The following conditions require separability and strong convexity of the objective function, 
along with differentiability of the objective function and constraints. 
(A[4|1) For every s G 5, the function F{x\s) is separable in x, 

d 

F{x\s) = J2fHx''\s). 

k=l 

Define the gradient function 

d 

v''{x'',s) = Q^fHx^\s)- 
(A[4|2) Let F{x\s) be strongly convex in x for every s G 5 with parameter m; that is, 
{V^F{x\s) - VyF{y\s)f {x - y) > m\\x - y\\l. 



(A 
(A 



3) F{x\s) is twice continuously differentiable in x and s for every x ^ X and s E 5. 

4) If the state space and decision space are well sampled around the point (x, s) and every 



j3{x',s) =v{x,s), 



observation is unbiased, that is, 

E 

with f (x, s) as in (A|4|l), then 

lim Vn[x, s) = v{x, s), 

n— >oo 

where Vn{x, s) is defined by the projection in Equation ([7]). 

Now we discuss the optimality conditions. Define Nx{x), the normal cone to X at x, 



Nx{x) = ^y(^W^\{y,x-v)> 0, G A'j . 



Define the subgradient of F{x\s) at x as dF[x\s). Then, the point x* is a global minimizer of 
F[x\s) for a fixed s if and only if 

QedF{x*\s) + Nx{x*). (10) 
We aim to show that as n — )• oo, x* (s) and only x* (s) satisfies Equation ([To]). 

Theorem 4.1. Let Sd be a compact subset of S and assumptions (J^1)-(J^4) hold. Then, for 
every s G Sd and every e > 0, 

P{|x:(s)-x*(s)| >e}^0 

as n ^ oo. 
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Proof. Fix s G Sd and e > 0. Consider the k^'^ component of the decision variable. Since X is 
compact, there exists an and a;^^^, such that < < x^^^, for all x G A". We place an 
e/4md-net on this axis such that x*'^{s) is in the center of one of the partitions, where m is the 
strong convexity constant. Label the regions ai, . . . ,a'^^i.y This is done for all components of the 
decision variable, k = 1, . . . ,d. 

Note that for every iteration, the decision Xn and the calculated optimum x* (s) are random 
variables. Let p^{ai,s) be the probability that Xn'^^s) G a^, 

p^(a,, s)=F {x;''=(s) G af } , i = 1, . . . M{k). 

Fix 7 > 0. Let C''{s) be the set of partitions where there exists a subsequence (rij)^]^ such that 

^Hs) = \af\ lim inf (af,s) >7 

That is, is the set of all decisions sampled infinitely often with at least limiting probability 7. 
Let 



A^(s) = U 



k 



tti - x*'''{s] 



< e/d} . 



We will show that C^{s) is non-empty and that C A^{s); this is done in two phases. 

Claim 1: C''{s) / 0. If 7 < l/M{k), then at least one decision will be sampled infinitely often 
with at least probability 7. 

Claim 2: >C'^(s) C A^{s). Suppose b G >C^(s) and b ^ A^{s). Then, there exists an infinite 
subsequence {rii)^-^ such that x* (s) G b. Since the marginal values, v„(x, s) are also continuous in 
s, there exists a set of s' such that lim inf p^^. (6, s') > 7 and \v^{x, s) — v^{x, s')\ < e/Amd for every 
X € b. Denote this set by B{s,b). The states in set B{s,b) will be sampled infinitely often, so by 
(A|4j2) and (A|4]4), there exists an N such that for every n > N, 

f^(x, s) — v{x, s) < m* e/2md = e/2d. 

But by (A[4|2) and (A|4j3), the function x*{s) is uniformly continuous in s over Sd- Combining this 
fact with (A|4jl), for sufficiently small e, 

^ Vn{x,s) + Nx{x) 

for all n > and all x G 6. Therefore, b ^ >C'^(s), so Claim 2 is true and therefore the theorem 
holds. □ 

We now discuss the choice of weight functions. 



5 Weight functions 

The choice of weight functions determines whether assumptions (A[3j3) and (A[4|4) are satisfied and 
which distributions of -F(x, s, w) satisfy them. More importantly, however, the weight functions also 
determine how well Fn{x\s) approximates F{x\s) with finite sample sizes. Before we discuss the 
specifics of individual weighting functions, let us discuss how weighting functions are constructed. 

Weighting functions rely on density estimation procedures to approximate the conditional den- 
sity f{y\s), where s is the state and y is the response. Then the mean conditional response 
E[y|s] = / yf{y\s)dy is calculated, which is the object of interest. The conditional density is 
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approximated by a weighted sum of observations, /n(y|s); it is either expressly composed by ob- 
servational weights, as in the case of kernel regression, or can be decomposed into observational 
weights, as in the case of Dirichlet process regression. 

In this section, we discuss two weighting schemes in detail: kernels and the Dirichlet process 
similarity measure. Kernels been well studied and are easy to implement as a weighting scheme. 
However, they often do not perform well with more complicated problems, such as those with 
a moderate or large number of covariates. Therefore, we propose a Dirichlet process similarity 
measure when a richer class of weighting functions is required. We discuss kernel weights in Sub- 
section 5.1 and the Dirichlet process similarity measure along with relevant background material 
in Subsection 15.21 



5.1 Kernel weights 

Kernel weights rely on kernel functions, K(s), to be evaluated at each observation to approximate 
the conditional density. A common choice for K with continuous covariates is the Gaussian kernel, 

Kh{s) = {2nh)~^/^exp{-s^/2h}, 

where the variance h is called the bandwidth. Kernel weights have the advantage of being simple 
and easy to implement. The simplest and most universally applicable weighting scheme is based 
on the Nadaraya- Watson estimator [33 If -^(s) is the kernel and /i„ is the bandwidth after n 
observations, define 

In the case of function-based optimization, the function estimate Fn{x\s) is 

Er=o^ Khn {s - Si) F (x. Si, Z{uJi+i)) 



FJxls) 



ELo (s - Sj) 



Kernel methods have few requirements to satisfy assumptions (A[3j3) and (A|4j4). For function- 
based optimization, assumption (AjsjS) is satisfied if: 

1. F(x, s, Z) has finite variance for every x £ X , s £ S . 

2. F{x\s) = E[F(x, s, Z)] is continuous in s for every x £ X. 
Sufficient conditions for gradient-based assumption (A|4]4) are similar: 

1. /3(x, s, Z) has finite variance for every x £ X , s £ S. 

2. v{x, s) = V^E[F(x, s, Z)] is continuous in s and x. 

The first condition is assured by assumption (A[4|3), so only the second condition must be checked. 

Despite ease of use and a guarantee of convergence, kernel estimators require a well-sampled 
space, are poor in higher dimensions and are highly sensitive to bandwidth size. There is a large 
literature on bandwidth selection [12l |28j , but it is usually chosen by cross-validation. To overcome 
many of these difficulties, we propose using a Dirichlet process mixture model over the states as an 
alternative weighting scheme. 
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5.2 Dirichlet process weights 

One of the curses of dimensionality is sparseness of data. As the number of dimensions grows, the 
distance between observations grows exponentially. In kernel regression, this means that only a 
handful of observations have weights that are effectively non-zero. Instead of basing weights on 
a distance that grows quickly with the number of dimensions, we would like to average responses 
for "similar" observations. Regions of "similarity" can be defined with a clustering algorithm. 
Dirichlet process mixture models (DPMMs) are Bayesian nonparametric models that produce a 
distribution over data partitions [38l[27j. They were introduced in the 1970's [l3l[7l[I], but have 
gained popularity in the last fifteen years as more powerful computers have allowed posterior 
computation [32l [TT] . DPMMs have been used for classification [52] , clustering [Ml EO] and density 
estimation |11[ [33] , but we will use the partitioning feature directly. 

Dirichlet process mixture models are easiest to understand in the context of density estimaiton. 
The idea is that complicated distributions, such as the distribution of the state variable, can be 
modeled as a countable mixture of simpler distributions, such as Gaussians. To make the model 
more flexible, the number of components is allowed to be countably infinite. The Dirichlet process 
is a Bayesian model that places a prior on the mixing proportions, leading to a small number of 
components with non-trivial proportions. Because an infinite number of components are allowed, 
the Dirichlet process circumvents the issue of determining the "correct" number of components, as 
is necessary in algorithms like k-means. 

The clustering/partitioning property of the Dirichlet process is derived from the mixture as- 
sumption. Two observations are in the same cluster or partition if they are generated by the same 
mixture component. The query state s is also placed in a cluster; the estimated response for s is 
simply the average of all the responses associated with states in that cluster. However, because 
the data labels, component locations and mixing proportions are not known, the Dirichlet pro- 
cess produces a distribution over clusterings. We use Monte Carlo methods to integrate over the 
clusterings. 

In this subsection, we discuss the basic properties of Dirichlet process mixture models, how 
they can be used to generate a weighting function, how it can be approximated, and finally what 
is required for it to satisfy the optimization algorithm assumptions. 



5.2.1 Dirichlet process mixture models 

A mixture model represents a distribution, gois), as a weighted sum of simpler distributions, 
g{s I 6i), which are parameterized by 9i, 

K 

9o{s) = ^Pig{s\9i). 

1=1 

Here, pi is the mixing proportion for component i. For example, if go{s) is a univariate, continuous 
distribution, we may wish to represent it as a sum of Gaussian densities, 

g,is) = ^p,^=e " . (11) 

i=i \ n-Ka"} 



In Equation (11), g is the Gaussian density; it is parameterized by 9i = {^i,af), the mean and 
variance for component i. 

The difficulties of using a mixture model are 1) determining parameters {pi,Oi)^i, and 2) 
determining K. There are many optimization based algorithms to find {pi,9i) (see |21] for a 
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review), but fewer ways to find a good value for K. However, if we assume K = co with only a 
finite number of components with weights pt that are effectively non-zero, then we effectively do 
not have to choose K. 

We can use a Dirichlet process (DP) with base measure Go and concentration parameter a to 
place a distribution over the joint distribution of {pi,6i), the mixture proportion and location of 
component i \1'6\ [T] . A Dirichlet process effectively places a discrete probability distribution over 
the parameter (6) space; the 6''s that are given positive probability in that space are components 
of the mixture model. The probability associated with 9 is the component weight p. We shall call 
this distribution over the parameter space P. Note that a Dirichlet process prior does not tell us 
deterministically what {pi,Oi) will be; instead it places a distribution over what it could be. For 
that reason, P is actually a random measure. A Dirichlet process mixture model is constructed as 
follows. 

Assume that data Si,...,Sn are drawn from the same distribution, which is modeled by a 
mixture over distribution G{6). We let g{-\0) be the density, while G{9) is the distribution, for 
example A^(^, cr^). Observation Si is drawn from a component of that model, G{6i), with parameter 
9i. Conditioned on 6i, Si has the distribution G{6i). Now, let P be the mixing distribution over 6] 
we give P a Dirichlet process prior with base distribution Go and concentration parameter a. In 
sum, this produces a Dirichlet process mixture model. We use the following hierarchical model for 
the DPMM, 

P~Z)P(a, Go), (12) 
ei\Pr^P, 

Si\ei^ Gie,). 

Here, "X ~ Y" means "X has the distribution of Y." Note the conditional independence at every 
level of the Model ( 12 ); for example, given 6i, Si is independent of P and the other Sj. Distributions 



F and Go often depend on additional hyperparameters; these will be explained in context later. 
A Dirichlet process is used as a prior on P because it produces an almost surely discrete 



distribution over parameters. This is demonstrated when we integrate out P from Model (12) to 
obtain a conditional distribution of 9n\0i;n-i Ui 



n-l 

i 

h,- - ■ , fn-l ~ 



^ n— 1 

n-l-^ a + n-1 



Here, 6$ is the Dirac measure with mass at 6. Equation (13) is known as a Polya urn posterior. 
Note that On has positive probability of assuming the value of one of the previously observed 6i , but 
it also can take a completely new value drawn from Go with positive probability. The parameter a 
controls how likely 9n is to take a new value. 

We now give an example of a DPMM. Suppose that go{s) is univariate and continuous. An 
infinite Gaussian mixture model is a good approximation, parameterized by 9i = {fii,af). Let 
Si, . . . ,Sn be drawn from this distribution. The mixture model can be written as, 

P~L>P(a,Go), (14) 
9, = {fii,af)\P^P, 

S^\9^^ N{fii,af). 

Often, Go is chosen to be conjugate to G to ease posterior sampling; in this case, the conjugate Go 
is Normal-Inverse-Gamma with hyperparameters {\o,i^Q,ao, /3q). a is also a hyperparameter; it is 
usually given a Gamma prior or set equal to 1. We now discuss DPMM weights. 
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5.2.2 The Dirichlet process weights 



Dirichlet process mixture models intrinsically produce a partition structure |38|, I27| . that is, a 



clustering of the observed data. We can see this in the Polya urn posterior of Equation (13); each 



hidden parameter has positive probability of taking the same value as another parameter. If two 
parameters have the same value, the associated observations are in the same partition/cluster. 
Because a Dirichlet process places a distribution over component parameters and probabilities, 
which form a partition of the data, it also places a distribution over all partition structures of the 
data. We use the partition structure to induce weights on the observations by giving equal weights 
to all observations that are in the same partition as the current observed state. 

Let the cluster /partition d be defined as the set of all observations that have the same param- 
eter, Ci = {j : 9j = 6*}. Let p = {Ci, . . . , C„(p)} be the partition of the observations {1, . . . , n}. 
Given a partition p, there are n{p) clusters, generating n(p) unique parameter values, 9^, . . . , (^n{p)- 
Now suppose that we know the partition p. Given this partition, we wish to place our observed 
state s into one of the partitions. We do not know its partition, but we can generate a probability 
that it is in cluster Ci, 



Ps{Ci\p)=F{s€Ci\p,S: 



l:n) 



OC \Ci 



g{s\e*)dHc,{e*), 



(15) 



where \Ci\ is the number of elements in Q, Hci{9*) is the posterior distribution of 9* conditioned 
on the base measure Go and the set of observations {Sj : Sj S Ci}. Sometimes it is impossi- 
ble to compute the integral in Equation (15) and it is approximated by Monte Carlo integration 
conditioned on 9*. 

The probability is calculated for each cluster Ci, i = l,...,n(p). Given the probabilities 
(ps(Cj|p))"i^^\ the weighting function is defined by 

n(p) 



•n{s,Si) I P = ^ 



Ps jCj I p) 

I C.I 



(16) 



Equation (16) is different from the conditional Polya urn posterior of Equation (13): unlike the 



observed states, the query state is not allowed to be in a cluster by itself. We do this because we 
often do not have prior information on the response distribution. 



Equation (16) is conditioned on a partition structure, but the Dirichlet process produces a 



distribution over partition structures. Let 7r(p) be the partition probability function, which is a 
prior distribution for partitions p. 



7r(p) 



n-=i(«+j) 



The posterior distribution, 7r(p|5'i:n) has the form 

n(p) 



7r(p|S'l:n) OC 7r(p) 



(17) 



(18) 



j=l •''T i(^Cj 



We can combine Equations ( 16 ) and ( 18 ) to obtain unconditional weights 



(19) 
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The conditional weights in Equation ( 16 ) are easy to compute, but it is nearly impossible to compute 



7r(p|S'i:n) or even enumerate all possible partitions, p. Therefore, we approximate Equation (19) by 
performing a Monte Carlo integration over the partitions. We obtain M i.i.d. posterior partition 
samples, (p^™'^)^=i a-iid set 

»"(».S)-SE E "'^^"^ ^ Hs.,C,r (20) 

m=l jr=l 

We now show how to obtain (p^™^)^=i given ^i, . . . , S„ using Gibbs sampling. 
5.2.3 Gibbs Sampler for the State Variable 

Markov Chain Monte Carlo (MCMC) [36] is the most popular and simple way to obtain partition 
structure samples, (p^'"-')m=i- MCMC is based on constructing a Markov chain that has a limiting 
distribution equal to the partition structure posterior distribution. The most common way to 
implement MCMC is by Gibbs sampling. In Gibbs sampling, the partition p = {Ci, . . . ,C„(p)} 
and possibly the parameters 9* = {9\, . . . ,dn{p)) ^orm the state of the Markov chain. If Go is a 
conjugate prior for G, then 9* is not needed and the sampler is called "collapsed"; otherwise, 9* 
is included. Every iteration we choose an i sequentially, with 1 < i < n, and remove Si from 
the clustering. Then, we randomly assign it to either 1) one of the existing clusters, or 2) to a 
new cluster. The assignment probabilities are chosen such that the limiting distribution of the 
Markov chain is the posterior distribution 7r(p | We follow Algorithm 3 of |36j : this algorithm 

is designed for problems with base measure Go conjugate to the state conditional distribution G. 
However, efficient algorithms for non-conjugate base measures can also be found in [36j . 

Let Cj be the partition number for observed state Si and let ric be the number of observations 
in cluster /partition c. The partition p can readily be reconstructed from ci, . . . , c^; ci, . . . , c„ have 
the same relation to Ci, . . . , C„(p) that 9i, . . . ,9n have to 0*, ... , ^*(p)- That is, Gi = {j : cj = i}. 

Gibbs sampling repeatedly samples a Markov chain where the limiting distribution is the pos- 



terior distribution of Equation (12). The state of the chain is c = (ci, . . . , Cn)- We move around 
the space by changing the partition for one observed state Si probabilistically while holding the 
partitions of all the other states fixed. Let 

C_ j — (ci , . . . , Cj — 1 , Cj-|_l , . . . , Cn), 

and ri-i^c be the number of observations in partition c when i has been removed. Fixing c_j, we 
compute the transition probabilities for Cj as follows: 

P(cj = c|c_j, S,) = / g{Si\9*)dH^,,,{9''), if c = c,- for j / i, 

n + a J 

p(cj / Co V j / i|c_i, 5,) = b-^ [ g(Si\9*)dGoiO*), otherwise. 

n + a J 

Here, 6 is a normalizing constant, H^i^c{9*) is the posterior distribution conditioned on the base 
measure Go and set of observations {Sj : 9j = 9*, j i}. Because Go is conjugate a conjugate 
prior for g, H-i c{9*) has a closed form solution (see [H] for a comprehensive list of posterior 
forms). The chain is run until some convergence criteria are met. Convergence is notoriously hard 
to diagnose, but general rules of thumb include setting a very large number of "burn-in" iterations 
and discarding all observations before the burn-in or running a number of chains and comparing 
the within and between sequence parameter variance or posterior probability. See ^14j for a more 
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Algorithm 3: Gibbs sampler for p|S'i;n with a conjugate base measure 
Require: Observed states Si, ... , Sn. 



9 

10 
11 



InitiaHze c, set m = 1. 
while m < M do 

for i = 1 to n do 

Sample Cj|c_j using transition probabilities 

F{ci = c\c_i,Si) = / g{S,\6*)dH^.Ue*), c = c,-, j / i, 

n + a J 

P(ci / cj Vj|c_i,5i) = / g{Si\e*)dGo{e*), otherwise 

Ti ~\- (y. J 

end for 

if Convergence criteria are satisfied then 
Construct p from c, 

Ci = {j : Cj =i], i = 1, . . . , max(c). 

Set p(™) = p. 
Set m = m + 1. 
end if 
end while 



thorough discussion of convergence criteria. The Gibbs sampler for (p^™^)^=i is given in Algorithm 

El 

We now discuss under which conditions Dirichlet process weights satisfy the convergence con- 
ditions for stochastic search with a state variable. 

5.2.4 Convergence conditions 

Assumptions (AjsjS) and (A|4j4) concern the consistency of the regression estimator, which in turn 
depends on the underlying observation distribution and weights. Dirichlet process weights also 
produce a density estimate; weak consistency of this density estimate is enough to satisfy the 
assumptions. 

Posterior consistency is the notion that the posterior distribution of the DP mixture model in 



Equation (12) accumulates in neighborhoods "close" to the true distribution of the observations. 
For weak consistency, we would like it to accumulate in weak neighborhoods. 

Weak consistency for DPMMs depends on both the model and the base measure; it has been 
examined in numerous articles |2l [T5| \W[ [6T| \59\ 147] . Gaussian DPMMs with a conjugate base 
measure, here the Normal-Inverse Wishart, are weakly consistent for many continuous densities. 
See [15], [61], [59] and [U] for conditions. [16] also show that DPMMs are consistent for finite 
distributions provided that the base measure Go gives full support to the required probability 
simplex. 

Assumption (A[3j3) is satisfied if: 

1. The DPMM and base measures are weakly consistent for the true state distribution go{s). 

2. The distribution of F(x, s, Z) is weakly convergent in s for every x £ X. That is, F{x, s' , Z) 
F{x, s, Z) as s' — )• s, where "=>" denotes weak convergence. 



18 



3. F{x, s, Z) is almost surely bounded and continuous in x and s. 

These three conditions combine to produce a weakly convergent Bayes estimate of the conditional 
density, which is generated by the Dirichlet process weights on the observations. These are much 
heavier conditions than those for kernel-based weights. 

Because random sampling is required in all current weak consistency results for DPMMs, it is 
an open question under which conditions Dirichlet process weights satisfy assumption (A|4j4). We 
now study function-based optimization, gradient-based optimization and the weighting functions 
empirically. 



6 Empirical analysis 

We analyzed the performance of function-based and gradient-based optimization algorithms in 
conjunction with kernel- and Dirichlet-based weights on the hour ahead wind commitment problem 
and the two-product newsvendor problem. 



6.1 Multi-product constrained newsvendor problem 

A multi-product newsvendor problem is a classic operations research inventory management prob- 
lem |37j . In the two product problem, a newsvendor is selling products A and B. She must decide 
how much of each product to stock in the face of random demand, and Db- A and B can 
be be bought for {ca,cb) and sold for {pa-,Pb)-, respectively. Any inventory not sold is lost. Let 
{xaiXb) be the stocking decisions for A and B respectively; it is subject to a budget constraint, 
bAXA + ^B xb < and a storage constraint, taxa + tbxb < r. An observable state S = (^i, S2) 
contains information about Da and Db- The problem is, 

max - CyiXA - cbxb +IE [pyimin(xyi,DA) +PBmin(xB,-DB) l-S* = s] (21) 

subject to : bAX A + ^B xb <h, 
rA XA + rBXB < r. 



We generated data for Problem (21) in the following way. Demand and two state variables 
were generated in a jointly trimodal Gaussian mixture with parameters Da = 1/3 * [A^(10,4) + 
A^(28,5) -hiV(30,5)], Db = 1/3 * [iV(10, 3) + A^(22, 9) A^(35, 12)]; there were two state variables. 
Si and 5*2; parameters were also included in the Gaussian trimodal mixture; all parameters were 
generated as follows: fia,i ~ A^(0,3), a'^j) ~ Inverse Gamma{l, I), a = A, B, i = 1,2,3. 



6.1.1 Newsvendor Competitors 

The following methods were compared: 

1. Function-based with kernel. Bandwidth is selected according to the "rule of thumb" method 
of the np package for R, 

hj = imajn-^/^^+^l 
where aj is defined as min(sd, interquartile range/1.349) |22| . 

2. Gradient-based with kernel. Bandwidth selection was the same as in the function-based case; 
decisions were made online after an initialization period of 5 random decisions. 
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Figure 1: State distribution labeled by demand level (low, medium, high). 

3. Function-based with Dirichlet weights. We used the following hierarchical model, 

P~DP(q,Go), (22) 

Si,j\6i ~ N{nij,alj), j = 1,2. 

Conjugate base measures were used. Posterior samples were drawn using Gibbs sampling with 
a fully collapsed sampler run for 500 iterations with a 200 iteration burn-in with samples taken 
every 5 iterations. 



4. Gradient-based with Dirichlet weights. Dirichlet process mixture model was as in Model 22 
base measures and sampling procedures were the same. Decisions were made online after an 
initialization period of 5 random decisions. 

5. Optimal. These are the optimal decisions with known mixing parameters and unknown com- 
ponents. That is, we know that the distribution of Z^a is 1/3* [iV(10, 4) + 7V(28, 5) + A^(30, 5)], 
we have similar knowledge for Db, Si and 5*2 and we know their joint, but we do not know 
from which component (1, 2 or 3) the observation was drawn. Decisions were made by solving 
the unconstrained newsvendor problem, projecting onto the constraint set (if necessary) and 
then performing a boundary search until the the optimal decision was reached. 

6.1.2 Newsvendor Results 

Decisions were made under each regime over eight sample paths; 100 test state/demand pairs were 
fixed and decisions were made for these problems given the observed states/decisions in the sample 
path for each method. The state distribution is shown in Figure [6?T| Results are given in Figure 



6.1 The kernel and Dirichlet process weights performed approximately equally for each method. 
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Figure 2: Gradient-based and function-based methods as a function of number of data points 
sampled. Results are averaged over 100 test problems with observed demand. 

Function-based methods did better than gradient-based methods as the function-based methods 
used more available information. Performance of Dirichlet weights versus kernel weights depends 
mainly on the underlying state distribution; there are cases for both in which one is preferable to 
the other. 

6.2 Hour ahead wind commitment 

The details of the hour ahead wind commitment problem from Section [T] are as follows. A wind 
farm manager must decide how much energy to promise a utility an hour in advance, incorporating 
knowledge about the current state of the world. The decision is the amount of wind energy pledged, 
a scalar variable. If more energy is pledged than is generated, the difference must be bought on 
the regulating market, which is expensive with a price that is unknown when the decision is made; 
otherwise, the excess is lost. The goal is to maximize expected revenue. The observable state 
variable is the time of day, time of year, wind history from the past two hours, contract price and 
current regulating price, 



rpD 

i 


= time of day. 


Tr 


= time of year. 


pR 


= current spot price. 




= contract price. 


Will 


= wind speed an hour ago. 


Wi 


= current wind speed. 


Si 


= observable state variable 






Xi 


= amount of energy pledged. 




= Pfx - Pf^^ max (x - Wi+i,0) 



The revenue that the wind farm receives, depends on the variables P^i and M/^j+i, which 

are not known until the next hour. We used wind speed data from the North American Land Data 
Assimilation System with hourly observations from 2002-2005 in the following locations: 

1. Amarillo, TX. Latitude: 35.125 N, Longitude: 101.50 W. The data have strong daily and 
seasonal patterns. The mean wind level is 186.29 (m/s)^ with standard deviation 244.86. 
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2. Tehachapi, CA. Latitude: 35.125 N, Longitude: 118.25 W. The data have strong seasonal 
patterns. The mean wind level is 89.45 (m/s)^ with standard deviation 123.47. 

Clean regulating and contract price data for the time period were unavailable, so contract prices 
were generated by Gaussian random variables with a mean of 1 and variance of 0.10. Regulating 
prices were generated by a mean-reverting (Ornstein-Uhlenbeck) process with a mean function that 
varies by time of day and time of year [51]. The data were analyzed separately for each location; 
they were divided by year, with one year used for training and the other three used for testing. 



6.2.1 Wind Competitors 

The following methods were compared on this dataset: 

1. Known wind. The wind is known, allowing maximum possible commitment, Xi = 
It serves as an upper bound for all of the methods. 

2. Function-based with kernel weights. Function-based optimization where the weights are gen- 
erated by a Gaussian kernel. Bandwidth is selected according to the "rule of thumb" method 
of the np package for R, 

/i, = 1.06a,n-i/(4+'^), 
where aj is defined as min(sd, interquartile range/1.349) [22] ■ 

3. Function-based with Dirichlet process weights. Function-based optimization with Dirichlet 
process based weights. We model the state distribution with the following hierarchical model, 

P~L»P(a,Go), Oi\P ^ P, 

TflOi ~ von Mises(//i,D, (pn), T^l^i ~ von Mises(/ii,y , (/)y), 

Wi\0i ~ N{fii^wi,crlwi), Wi-i\9i ~ N{iii^w2,o-lw2)^ 

We modeled the time of day, TP , and year, TY, with a von Mises distribution, an expo- 
nential family distribution over the unit sphere; the dispersion parameters, cj)D and 0y, are 
hyperparameters. The base measure was Normal-Inverse Gamma for P^ , P^, Wi and Wi-i 
and uniform for the means of and T-^. 100 posterior samples were drawn using Gibbs 
sampling with a collapsed sampler for all conjugate dimensions after a 1,000 iteration burn-in 
and 10 iteration pulse between samples. 

4. Ignore state. Sample average approximation is used, 

n-l 



1 

F„{x\s) = - Vy^+i 



n 
i=0 
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Method/Location 2002 2003 2004 2005 

Tehachapi, CA 
Known Wind 
Function with Kernel 
Function with DP 
Ignore State 
Amarillo, TX 

Known Wind 186.0 175.2 184.9 175.2 

Function WITH Kernel 155.1 (83.4%) 149.6 (854%) 154.7 (83.7%) 146.2 (83.5%) 
Function with DP 168.2(90.4%) 160.6(91.7%) 167.1(90.4%) 159.4(91.0%) 

Ignore State 70.3 (37.8%) 68.7 (39.2%) 69.6 (37.6%) 66.1 (37.7%) 

Table 1: Mean values of decisions by method, year and data set. Percentages of the upper bound, 
Known Wind, are given for the other methods. 



6.2.2 Wind Results 

Results are presented in Table [T] We display the value of each algorithm, along with percentages of 
Known Wind for the other three methods. Function-based optimization with both types of weights 
outperformed the algorithm in which the state variable was ignored by a large margin (>45% of the 
best possible value). Dirichlet process weights outperformed kernel weights by a smaller but still 
significant margin (5.6-8.2% of best possible value). This is because the DP weights put substantial 
values on many more observations than kernel weights did in areas with high wind; in effect, kernel 
weights simply used the one or two closest observations in these areas for prediction. 

7 Discussion 

We presented new model-free methods to solve stochastic search problems with an observable state 
variable, function-based optimization and gradient-based optimization. We provided conditions for 
convergence for each. Both algorithms rely on weighting observations; we gave an easily imple- 
mentable weighting function (kernels) and a more complex weighting function (Dirichlet process 
based). Empirical analysis shows that Dirichlet process weights add value when the state variable 
distribution is moderate to high dimensional or has super- Gaussian tails; this was the case in the 
wind commitment problem. 

More generally, this work shows that statistics and machine learning can provide solutions 
to currently intractable search and optimization problems. Traditional search and optimization 
methods are designed to handle problems with large, complex decision spaces. However, there 
are many problems where complexity is derived from elements aside from the decision, such as an 
observable state variable. Statistics and machine learning offer an array of tools, such as clustering, 
density estimation, regression and inference, that may prove useful in solving problems that are 
now currently avoided. 
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